Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build transforms wheel #493

Merged
merged 35 commits into from
Sep 24, 2024
Merged

Build transforms wheel #493

merged 35 commits into from
Sep 24, 2024

Conversation

touma-I
Copy link
Collaborator

@touma-I touma-I commented Aug 9, 2024

Why are these changes needed?

Allow users to do pip install of all the transforms without having to clone the project.
This process clones the project behind the scene and installs the various transforms.
Also account for the transforms that do not run on MacOS

Related issue number (if any).

Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a

  1. Makefile in each new directory and support the build, test, publish, clean and set-versions targets.
  2. A readme in each directory.
  3. Extra credit for a pypi-specific readme

TRANSFORMS_NAMES = code/code_quality \
code/code2parquet \
code/header_cleanser \
code/code_quality \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code2parquet, malware

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware transforms need some additional work to be included as a pip install and was intentionally excluded from this initial release... It seem to require a docker container and does not fit in the current use case for running these transforms in a notebook.

* [code2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code2parquet/python/README.md)
* header_cleanser (Not available on MacOS)
* code_quality
* proglang_select
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware transforms need some additional work to be included as a pip install and was intentionally excluded from this initial release... It seem to require a docker container and does not fit in the current use case for running these transforms in a notebook.

* proglang_select
* header_cleanser (Not available on MacOS)
* code_quality
* repo_level_ordering
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malware transforms need some additional work to be included as a pip install and was intentionally excluded from this initial release... It seem to require a docker container and does not fit in the current use case for running these transforms in a notebook.

* profiler
* doc_id
* filter
* resize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profiler,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky what is meant by this comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing transform

Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need a recursive Makefile in transforms/packaging directory.

transforms/packaging/python/Makefile Outdated Show resolved Hide resolved
transforms/packaging/python/pyproject.toml Outdated Show resolved Hide resolved
transforms/packaging/python/pyproject.toml Outdated Show resolved Hide resolved
@@ -0,0 +1,53 @@
[project]
name = "data_prep_toolkit_transforms_ray"
version = "0.2.1.dev1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dev0

@touma-I touma-I requested a review from daw3rd September 11, 2024 20:08
@daw3rd
Copy link
Member

daw3rd commented Sep 13, 2024

@touma-I I think you will also need to update the .github/workflows/test.yml to add a job for this packaging directory.

@touma-I touma-I requested a review from daw3rd September 23, 2024 18:57
@touma-I touma-I marked this pull request as ready for review September 23, 2024 18:58
Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may need to modify scripts/check-transform-workflows.sh to not include the packaging directory.

.github/workflows/test-packaging-python.yml Outdated Show resolved Hide resolved
@touma-I touma-I requested a review from roytman September 24, 2024 09:06
Copy link
Collaborator

@shivdeep-singh-ibm shivdeep-singh-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -480,7 +480,8 @@ endif
if [ -e requirements.txt ]; then \
echo Installing requirements from requirements.txt; \
pip install $(PIP_INSTALL_EXTRA_ARGS) $$extra_url -r requirements.txt; \
elif [ -e pyproject.toml ]; then \
fi; \
if [ -e pyproject.toml ]; then \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you want to allow dependencies installation from both requirements.txt and pyproject.toml. It can confuse.
I'd add a WARNING message if both files exist. And specify in the message the installation order: first from requirements.txtand after that frompyproject.toml`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @roytman . The situation we ran into is the pyproject.toml for the single package is using requirements.txt that was compiled from the different dependencies listed by all transforms. I will also be changing all the transforms pyproject.toml to use requirements.txt to list their dependencies. But agree, I need to watch this one closely.

@@ -587,6 +588,18 @@ MINIO_ADMIN_PWD= localminiosecretkey
> tt.toml; \
mv tt.toml pyproject.toml; \
fi
@if [ -e requirements.txt ]; then \
cat requirements.txt | sed \
-e 's/data-prep-toolkit-ray\([=><~][=]\).*/data-prep-toolkit-ray\1$(DPK_LIB_VERSION)/' \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can all these sed operations be replaced by a single macro? the lines 582-587 are almost identical to lines 593-599.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Almost but not exactly. But agree, I will take a deeper look in a followup PR.

@@ -1,6 +1,6 @@
## Data prep kit
data-prep-toolkit-transforms==0.2.1.dev1
data-prep-toolkit-transforms-ray==0.2.1.dev1
#data-prep-toolkit-transforms==0.2.1.dev1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that these are comments, but should they be with 0.2.1.dev3 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this notebook will still work with dev3. There were changes to the transforms that broke the notebook. I think Sunjee has a new release he will be checking in soon.

@@ -0,0 +1,5 @@
**/src
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we combine it (except src) with the .gitignore in the root directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work to. I think there may be situations in the future where we want to. add the wheel for specific transforms to git so folks can download it from git. This is still under discussion but yes, once we have consensus, we should move the gitignore to higher level as appropriate.



````
cd package-release/transforms/packaging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users are in package-release, so the command should be cd transforms/packaging

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! Thanks

This procedure will run all the UT for each individual transforms using a single package configuration:

````
cd package-release/transforms/packaging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, or write something like, if you are not in the package-release/transforms/packaging, cd into it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks

This procedure will buid two wheels: one for the python transforms and one for the ray transforms.

````
cd package-release/transforms/packaging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks

* [resize](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/resize/python/README.md)
* [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/tokenization/doc_chunk/python/README.md)
* [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/doc_id/python/README.md)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot keep this list up to date, so it might be worth adding something like the list includes some of the transformers but doesn't all of them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks

@touma-I
Copy link
Collaborator Author

touma-I commented Sep 24, 2024

I think you may need to modify scripts/check-transform-workflows.sh to not include the packaging directory.

@touma-I touma-I closed this Sep 24, 2024
@touma-I touma-I reopened this Sep 24, 2024
daw3rd
daw3rd previously requested changes Sep 24, 2024
- "*"
paths:
- "transforms/packaging/python/**"
- "data-processing-lib/**"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't include the transforms, you should probable not include the core library either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i pushed a fix for this

- "*"
paths:
- "transforms/packaging/ray/**"
- "data-processing-lib/**"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, library not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i pushed a fix for this.

@@ -0,0 +1,60 @@
REPOROOT=../../
# Use make help, to see the available rules
include ../../.make.defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should include $(REPOROOT)/.make.defaults and in general use REPOROOT

# code/repo_level_ordering # Missing implementation


TRANSFORMS_NAMES = code/code_quality \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should build this list automatically by looking for the python directories in transforms//

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this file being created by hand? is the plan to eventually automate this file.

@touma-I touma-I dismissed daw3rd’s stale review September 24, 2024 16:14

Let's address those in a separate PR. Thanks

Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i fixed workflow paths. others are minor comments.

@touma-I touma-I merged commit f5d9680 into dev Sep 24, 2024
77 checks passed
@touma-I touma-I deleted the build-transforms-wheel branch September 24, 2024 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants